Ultra Brief Manual of Biostatistics using ‘R’: JK (Nov 2018) 

1) Download and install R (version R 2.11.1 or higher) according to your PC's Operating System 
(Linux, Windows, or Macintosh) from: https://cloud.r-project.org/ 

Or any other mirrors at: https://cran.r-proiect.org/mirrors.html 

2) Install R Studio from the following link according to your PC's Operating System: 
https: //www. rstudio. com/products/r studio/do wnlo ad/ 

3) Launch R Studio 

4) Click the "Tools" menubar at the top, select the "Install Packages" Option. When a dialog box 
appears, type "Rcmdr" (without " ") in the box, make sure that box "install dependencies" is 
checked. Then click "install". 

5) All other relevant packages or missing packages can be installed in this way (by typing the 
standard shortform of the package). R is case sensitive. Upper case and lower case in spelling 
matters. 

6) To do statistical analysis. Launch R Studio, then select "Packages" from the menu bar in the 
lower right sector window, then type Rcmdr in the search bar, once Rcmdr is listed, select it by 
checking it. Install (optional) any dependencies prompted while Rcmdr is launched. 

If Rcmdr shows warning messages of missing packages (e.g. sem, leaps) while loading, and these 
packages fails to be compiled/installed, attempt the installation of the missing packages (optional). 
In the Rcmdr window, go to "tools", then "load package(s)"; then select all the packages and click 
OK to load all available packages. 

(a) Usable data for statistical analysis preferably should be in .csv or .xls format. 

(b) The name of variables or name of factors must be in the first row of the .csv file or the .xls 
spreadsheet. 

(c) The rows will contain the data for each series or item or person. 

(d) All the data in consecutive rows for each variable / factor must be in the same column of 
that particular variable name or factor name. E.g. all data of "weight" must come under the 
"weight" column in consecutive rows; all data on "group code" must come under the "group 
code" column in consecutive rows. 

(e) Importing data into Rcmdr (R Commander): Click "Data" menubar — > "Import data" — > 
"from text file, clip board, or URL" —> Choose Location of data file as: Local file system; 
Choose field seperator as: Commas [,] — > click OK to open the file browser window —> 
select the .csv file and click "open" to import data of the .csv file. 

(f) Basic Statistics: 

i. For Descriptive Statistics, click on "Statistics" menubar —> "Summaries" — > "Table of 
statistics". 

ii. For Chi Square Test or Fisher exact test (comparison of categorical variables like male, 
female, etc), click "Statistics" menubar — > "Contingency tables" — > "two way table", or 
"Multiway table", as applicable. 

iii. For Student's t-test (comparison of means of normally distributed data), click "Statistics" 
menubar —> "Means" — > "Single sample t-test", or "Independent sample t-test", or 
"pairedt-test" or "one way ANOVA", or "multi way ANOVA" as applicable. 

iv. For Non-parametric tests (comparison of non-normally distributed data), click 
"Statistics" menubar —> "Non-parametric tests" — > "Two-sample Wilcoxon test", or 
"Single sample Wilcoxon test", or "Paired samples Wilcoxon test", or "Kruskal-Wallis 
test", or "Friedman rank sum test", as applicable. 

(g) Basic Graphical analysis: 

i. Box plot: "Graph" — > "Boxplot" —> choose plot by groups and choose grouping 
variable — > select variable to get Boxplot. 
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ii. Bar graph: "Graph" —> "Bar graph" — > choose plot by groups and choose grouping 
variable —> select variable to get Bar graph. 

(h) Basic Functions for Power Analysis and Sample Size Calculations using R package ‘pwr’: 

https://github.com/heliosdrm/pwr 

Please read the accompanying documentation of Power analysis functions along the lines of 

Cohen (1988) by the Maintainer, Helios De Rosario < helios.derosario@gmail.com > at the 

CRAN Repository, Date/Publication 2018-03-03 22:41:13 UTC. 

What Statistical Test to Use for my Biostatistics Research Data? 

(a) It depends on 

i. What is the research question? 

ii. Which variables will help answer the research question and which is the dependent 
variable? 

iii. What type of variables are they? 

iv. Should a parametric or non-parametric test be used? 

(b) Key concepts: 

i. VARIABLE: Characteristic which varies between independent subjects. 

ii. CATEGORICAL VARIABLES: variables such as gender with limited values. They can 
be further categorised into NOMINAL (naming variables where one category is no 
better than another e.g. hair colour) and ORDINAL, (where there is some order to the 
categories e.g. 1st, 2nd , 3rd etc). 

iii. CONTINUOUS (SCALE) VARIABLES: Measurements on a proper scale such as age, 
height etc. 

iv. INDEPENDENT VARIABLE: The variable we think has an effect on the dependent 
variable. 

v. DEPENDENT VARIABLE: The variable of interest which could be influenced by 
independent variables. 

vi. PARAMETRIC TESTS: there are various assumptions for parametric tests including the 
assumption that continuous dependent variables are normally distributed. There are 
specific tests for this within packages such as Shapiro-Wilk Normality Test, but plotting 
a histogram is also a good guide. As long as the histogram of the dependent variable 
peaks in the middle and is roughly symmetrical about the mean, we can assume the data 
is normally distributed; otherwise the data is non-normally distributed. 
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Comparing: 

Dependent 

variable 

Independent 

variable 

Parametric test 
(Dependent variable is 
normally distributed) 

Non-parametric test (Dependent 
variable is 

non-normally distributed) 

The means / medians* of two 
INDEPENDENT groups 

Continuous / scale 
data 

Categorical / 
nominal data 

Independent t-test 

Mann-Whitney test 

The means / medians* of 2 paired 
(matched) samples e.g. serum 
cholesterol level, before and after a 
therapy for one group of patients. 

Continuous / scale 
data 

Time variable (time 

1 = before, time 2 = 
after) data 

Paired t-test 

Wilcoxon signed rank test 

The means / medians* of three or 
more independent groups. 

Continuous / scale 
data 

Categorical / 
nominal data 

One-way ANOVA 

Kruskal-Wallis test 

Three or more measurements on the 
same patient. 

Continuous / scale 
data 

Time variable 

Repeated measures ANOVA 

Friedman test 

Relationship between two 
continuous variables 

Continuous / scale 
data 

Continuous / scale 
data 

Pearson’s Correlation Co¬ 
efficient 

Spearman’s Correlation Co-efficient 
(also use for ordinal data) 

Predicting the value of one variable 
from the value of a predictor variable 

Continuous / scale 
data 

Any 

Simple Linear Regression 


Assessing the relationship between 
two categorical variables 

Categorical / 
nominal data 

Categorical / 
nominal data 


Chi-squared test 


*Mean and Standard Deviation to be used as a measures of the central tendency and dispersion of data, when the data is normally distributed. Median and Inter 


Quartile Range to be used as measures of the central tendency and dispersion of data, when the data is non-normally distributed. 

Examples: 

• Are age and serum cholesterol level related? Both are continuous variables so Pearson’s Correlation Co-efficient would be appropriate if the variables are both normally 
distributed. 

• Can height predict weight? You cannot determine height from weight but you could estimate weight given height so height is the continuous independent variable. Simple 
linear regression will help decide if weight is a good predictor of height and produce an equation to predict weight given an individual’s height. 

• Is Drug A more effective than Drug B? A researcher would randomly allocate subjects to two groups with one group receiving Drug A, and the other Drug B. Serum 
cholesterol is measured before and after the drugs and the mean /median cholesterol reduced compared between the two groups. The dependent variable ‘cholesterol level 
reduced’ is continuous. The independent variable is the group the subject is in which is categorical. If the data is normally distributed, use the independent t-test, if not use 
the Mann-Whitney test. 

• Are patients taking treatment A more likely to recover than those on treatment B? Both ‘Treatment’ (A or B) and ‘Recovery’ (Yes or No) are categorical variables so the Chi- 
squared test is appropriate. 




